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Abstract 

Implementing Bayesian variable selection for linear Gaussian regression models for analysing 
high dimensional data sets is of current interest in many fields. In order to make such analysis oper- 
ational, we propose a new sampling algorithm based upon Evolutionary Monte Carlo and designed 
to work under the "large p, small n" paradigm, thus making fully Bayesian multivariate analysis 
feasible, for example, in genetics/genomics experiments. Two real data examples in genomics are 
presented, demonstrating the performance of the algorithm in a space of up to 10, 000 covariates. 
Finally the methodology is compared with a recently proposed search algorithms in an extensive 
simulation study. 
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1 Introduction 

This paper is a contribution to the methodology of Bayesian variable selection for linear Gaussian re- 
gression models, an important problem which has been much discussed both from a theoretical and 
a practical perspective (see Chipman et al, 2001 and Clyde and George, 2004 for literature reviews). 
Recent advances have been made in two directions, unravelUng the theoretical properties of different 
choices of prior structure for the regression coefficients (Fernandez et al, 2001; Liang et al, 2008) and 
proposing algorithms that can explore the huge model space consisting of all the possible subsets when 
there are a large number of covariates, using either MCMC or other search algorithms (Kohn et al. , 200 1 ; 
Dellaportas et al, 2002; Hans et al, 2007). 

In this paper, we propose a new sampUng algorithm for implementing the variable selection model, 
based on tailoring ideas from Evolutionary Monte Carlo (Liang and Wong, 2000; Jasra et al, 2007; 
Wilson et al., 2009) in order to overcome the known difficulties that MCMC samplers face in a high 
dimension multimodal model space: enumerating the model space becomes rapidly unfeasible even for 
a moderate number of covariates. For a Bayesian approach to be operational, it needs to be accompanied 
by an algorithm that samples the indicators of the selected subsets of covariates, together with any other 
parameters that have not been integrated out. Our new algorithm for searching through the model space 
has many generic features that are of interest per se and can be easily coupled with any prior formulation 
for the variance-covariance of the regression coefficients. We illustrate this by implementing ^-priors 
for the regression coefficients as well as independent priors: in both cases the formulation we adopt 
is general and allows the specification of a further level of hierarchy on the priors for the regression 
coefficients, if so desired. 

The paper is structured as follows. In Section 2, we present the background of Bayesian variable 
selection, reviewing briefly alternative prior specifications for the regression coefficients, namely p-priors 
and independent priors. Section 3 is devoted to the description of our MCMC sampler which uses a wide 
portfolio of moves, including two proposed new ones. Section 4 demonstrates the good performance of 
our new MCMC algorithm in a variety of real and simulated examples with different structures on the 
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predictors. In Section 4.2 we complement the results of the simulation study by comparing our algorithm 
with the recent Shotgun Stochastic Search algorithm of Hans et al. (2007). Finally Section 5 contains 
some concluding remarks. 

2 Background 
2.1 Variable selection 

Let y = (yi, . . . , be a sequence of n observed responses and Xi = {xn, . . . , XipY a vector of 
predictors for y^, z = 1, . . . , n, of dimension p x 1. Moreover let X be the n x p design matrix with iih 
row x^ . A Gaussian linear model can be described by the equation 

y = aln + X/3 + £, 

where a is an unknown constant, 1„ is a column vector of ones, /3 = . . . , fi^f is ap x 1 vector of 
unknown parameters and e ~ iV (O, cr^/n) . 

Suppose one wants to model the relationship between y and a subset of xi, . . . , Xp, but there is un- 
certainty about which subset to use. Following the usual convention of only considering models that 
have the intercept a, this problem, known as variable selection or subset selection, is particularly in- 
teresting when p is large and parsimonious models containing only a few predictors are sought to gain 
interpretabiUty. From a Bayesian perspective the problem is tackled by placing a constant prior density 
on a and a prior on ^ which depends on a latent binary vector 7 = (71, . . . , 7^)"^, where 7^ = 1 if 
/3j 7^ and 7^ = if /3j = 0, j = 1, . . . ,p. The overall number of possible models defined through 7 
grows exponentially with p and selecting the best model that predicts y is equivalent to find one over the 
TP subsets that form the model space. 

Given the latent variable 7, a Gaussian linear model can therefore be written as 

y = aln + X^^^ + £, (1) 

where is the non-zero vector of coefficients extracted from ^, is the design matrix of dimension 
n X p-y, p-y = 7^1p, with columns corresponding to 7^ = 1. We will assume that, apart from the intercept 
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a, xi, . . . ,Xp contains no variables that would be included in every possible model and that the columns 
of the design matrix have all been centred with mean 0. 

It is recommended to treat the intercept separately and assign it a constant prior: p{a) oc 1, 
Fernandez et al. (2001). When coupled with the latent variable 7, the conjugate prior structure of 
(^-y, (7^) follows a normal-inverse-gamma distribution 

p(^^|7,(72) =iv(m^,a2s^) (2) 

P (c^^ It) = P (o"^) = InvGa {aa, b^) (3) 

with acr,bcr > 0. Some guidehnes on how to fix the value of the hyperparameters and 6^ provided 
in Kohn et al. (2001), while the case Ua- = = corresponds to the Jeffreys' prior for the error 
variance, p (cr^) oc a~^. Taking into account (1), (2), (3) and the prior specification for a, the joint 
distribution of all the variables (based on further conditional independence conditions) can be written as 

P {y, 7, =p{y \l, Oi,^^,(j'^ )p (a) p {P^ I7, ) p (cr^) p (7) . (4) 

The main advantage of the conjugate structure (2) and (3) is the analytical tractability of the marginal 
hkeUhood whatever the specification of the prior covariance matrix E-y: 

J P {yll^ot: P-yj(^^) P ici)p {l^-y |7) o'^ ) P {f^"^) dadp^da^ 

oc \X^X^ + |S^rV2 (26, + S (^))-(2a.+n-l)/2 ^ (5) 

where S (j) = C - M^K~'^M, with C = {y - ynf {y - Vn) + m^T,-^m^, M = {y - y„) + 
E-^m^ and = X^X^ + E"^ (Brown et al, 1998). 

While the mean of the prior (2) is usually set equal to zero, m-y = 0, a neutral choice (Chipman et al., 
2001; Clyde and George, 2004), the specification of the prior covariance E-y matrix leads to at least two 
different classes of priors: 

• When E-y = gVy, where 5 is a scalar and Vy = (X^X^) ^, it replicates the covariance structure of 
the likelihood giving rise to so called 5-priors first proposed by Zellner (1986). 
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• When = cV^, but = Ip^ the components of are conditionally independent and the posterior 
covariance matrix is driven towards the independence case. 

We will adopt the notation S-y = rVy as we want to cover both prior specification in a unified manner. 
Thus in the 5-prior case, S-y = r (X^X-y) ^ while in the independent case, S-y = rlp^. We will refer 
to r as the variable selection coefficient for reasons that will become clear in the next Section. 

To complete the prior specification in (4), p (7) must be defined. A complete discussion about alter- 
native priors on the model space can be found in Chipman (1996) and Chipman et at. (2001). Here we 
adopt the beta-binomial prior illustrated in Kohn et at. (2001) 



^(7) = J p {j \uj) p {oj) (ko = 



B{p^ + a^,p- p^ + b^) 

with p-y = 7-^1^, where the choice p{'y\co) = u'^"' (1 — o;)^"^^ implicitly induces a binomial prior 
distribution over the model size and p (w) = u"'"'-^ (1 — u)''"~^ /B {a^, boj). The hypercoefficients a^^ 
and 6a, can be chosen once E (pj) and V (pj) have been elicited. In the "large p, small n" framework, 
to ensure sparse regression models where p^ <C p, it is recommended to centre the prior for the model 
size away from the number of observations. 

2.2 Priors for the variable selection coefficient r 

2.2.1 g'-priors 

It is a known fact that 51-priors have two attractive properties. Firstly they possess an automatic scaling 
feature (Chipman et al, 2001; Kohn et al, 2001). In contrast, for independent priors, the effect of 
Vy = Ip^ on the posterior distribution depends on the relative scale of X and standardisation of the 
design matrix to units of standard deviation is recommended. However, this is not always the best 
procedure when X is possibly skewed, or when the columns of X are not defined on a common scale of 
measurement. The second feature that makes ^-priors particularly appealing is the rather simple structure 
of the marginal likelihood (5) with respect to the constant r which becomes 

OC (1 + t)-P-'/^ {2b„ + S (^))-(2a<.+n-l)/2 ^ (7) 



where, if = 0, S (7) = {y - y„)^ (y - Vn) - {y - ynf {X'^X^) ^ {y - yn). For 
computational reasons explained in the next Section, we assume that (7) is always defined: since we 
calculate S (7) using the QR-decomposition of the regression (X-y, y — yn) (Brown et al., 1998), when 
n < Pj, S (7) = (y — Pn)'^ {y — yn) /(! + ''")• Despite the simpUcity of (7), the choice of the constant 
T for y-priors is complex, see Fernandez et al. (2001), Cui and George (2008) and Liang et al. (2008). 

Historically the first attempt to build a comprehensive Bayesian analysis placing a prior distribution 
on r dates back to Zellner and Siow (1980), where the data adaptivity of the degree of shrinkage adapts 
to different scenarios better than assuming standard fixed values. Zellner-Siow priors, Z-S hereafter, can 
be thought as a mixture of y-priors and an inverse-gamma prior on r, r ~ InvGa{l/2, n/2), leading to 

p{(3^\-f,a^) (X J N(o,a^T{X^X^)''^p{r)dT. (8) 

Liang et al. (2008) analyse in details Z-S priors pointing out a variety of theoretical properties. From 
a computational point of view, with Z-S priors, the marginal likelihood p (y I7 ) = J p{y\"f,T)p{T)dT 
is no more available in closed form, something which is advantageous in order to quickly perform a 
stochastic search (Chipman et al, 2001). Even though Z-S priors need no calibration and the Laplace 
approximation can be derived (Tiemey and Kadane, 1986), see Appendix A.2, never became as popular 
as gi-priors with a suitable constant value for r. For alternative priors, see also Cui and George (2008) 
and Liang et al. (2008). 

2.2.2 Independent priors 

When all the variables are defined on the same scale, independent priors represent an attractive alternative 
to g'-priors. The hkelihood marginalised over a, and becomes 

p (y |7) OC T-P-^/' \X^X, + Tl,^\-"' {2K + S (^))-(2«.+n-l)/2 ^ (9) 

where, if = 0, 5 (7) = (y - yn)^ (y - yn) - (y - Vn)^ X^ (^7 ^7 + ^hiY^ ^7 - Vn)- Note 
that (9) is computationally more demanding than (7) due to the extra determinant operator. 

Geweke ( 1 996) suggests to fix a different value of , j = 1 , . . . , p, based on the idea of "substantially 
significant determinant" of AXj with respect to Ay. However it is common practice to standardise the 
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predictor variables, taking r = 1 in order to place appropriate prior mass on reasonable values of the 
regression coefficients (Hans et al, 2007). Another approach, illustrated in Bae and Mallick (2004), 
places a prior distribution on Tj without standardising the predictors. 

Regardless of the prior specification for r, using the QR-decomposition on a suitable transformation 
of X-y and y — yn, the marginal likelihood (9) is always defined. 

3 MCMC sampler 

In this Section we propose a new sampling algorithm that overcomes the known difficulties faced by 
MCMC schemes when attempting to sample a high dimension multimodal space. We discuss in a unified 
manner the general case where a hyperprior on the variable selection coefficient r is specified. This 
encompasses the ^-prior and independent prior structure as well as the case of fixed r if a point mass 
prior is used. 

The multimodaUty of the model space is a known issue in variable selection and several ways to tackle 
this problem have been proposed in the past few years. Liang and Wong (2000) suggest an extension 
of parallel tempering called Evolutionary Monte Carlo, EMC hereafter, Nott and Green, N&G hereafter, 
(2004) introduce a sampling scheme inspired by the Swendsen-Wang algorithm while Jasra et al. (2007) 
extend EMC methods to varying dimension algorithms. Finally Hans et al. (2007) propose when p > n 
a new stochastic search algorithm, SSS, to explore models that are in the same neighbourhood in order 
to quickly find the best combination of predictors. 

We propose to solve the issue related to the multimodality of model space (and the dependence be- 
tween 7 and r) along the fines of EMC, applying some suitable parallel tempering strategies directly 
on p (y I7, r ). The basic idea of parallel tempering, PT hereafter, is to weaken the dependence of a 
function from its parameters by adding an extra one called "temperature". Multiple Markov chains, 
called "population" of chains, are run in parallel, where a different temperature is attached to each chain, 
their state is tentatively swap at every sweep by a probabihstic mechanism and the latent binary vector 
7 of the non-heated chain is recorded. The different temperatures have the effect of flatting the likeli- 
hood. This ensures that the posterior distribution is not trapped in any local mode and that the algorithm 

7 



mixes efficiently, since every chain constantly tries to transmit information about its state to the others. 
EMC extents this idea, encompassing the positive features of PT and genetic algorithms inside a MCMC 
scheme. 

Since (5 and cr^ are integrated out, only two parameters need to be sampled, namely the latent binary 
vector and the variable selection coefficient. In this set-up the full conditionals to be considered are 

\p{li\---t"' oc \p{y\iuT)f'^ \p{li)f'' (10) 

where L is the number of chains in the the population and ti, 1 = ti < t2 < ■ ■ ■ < is the temperature 
attached to the Ith chain while the population 7 corresponds to a set of chains that are retained simulta- 
neously. Conditions for convergence of EMC algorithms are well understood and illustrated for instance 
in Jasra et al. (2007). 

At each sweep of our algorithm, first the population 7 in (10) is updated using a variety of moves in- 
spired by genetic algorithms: "local moves", the ordinary MetropoUs-Hastings or Gibbs update on every 
chain; and "global moves" that include: i) selection of the chains to swap, based on some probabiUstic 
measures of distance between them; ii) crossover operator, i.e. partial swap of the current state between 
different chains; iii) exchange operator, full state swap between chains. Both local and global moves are 
important although global moves are crucial because they allow the algorithm to jump from one local 
mode to another. At the end of the update of 7, r is then sampled using (11). 

The implementation of EMC that we propose in this paper includes several novel aspects: the use 
of a wide range of moves including two new ones, a local move, based on the Fast Scan MetropoUs- 
Hastings sampler, particularly suitable when p is large and a bold global move that exploits the pattern 
of correlation of the predictors. Moreover, we developed an efficient scheme for tuning the temperature 
placement that capitalises the effective interchange between the chains. Another new feature is to use 
a Metropolis- within-Gibbs with adaptive proposal for updating r, as the full conditional (11) is not 
available in closed form. 
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3.1 EMC sampler for 7 

In what follows, we will only sketch the rationale behind all the moves that we found useful to implement 
and discuss further the benefits of the new specific moves in Section 4.1. For the "large p, small n" 
paradigm and complex predictor spaces, we believe that using a wide portfoUo of moves is needed and 
offers better guarantee of mixing. 

From a notational point of view, we will use the double indexing 7; j, I = 1, . . . ,L and j = 
1,. . . ,p to denote the jth latent binary indicator in the Ith chain. Moreover we indicate by ji = 
{11,1 J ■ ■ ■ 5 7/,p)^ ths vector of binary indicators that characterise the state of the Ith chain of the pop- 
ulation 7 = (71,..., 7l). 

Local moves and Fast Scan Metropolis Hastings sampler 

Given r, we first tried the simple MC^ idea of Madigan and York (1995), also used by Brown et al. 
(1998) where add/delete and swap moves are used to update the latent binary vector 7;. For an add/delete 
move, one of the p variables is selected at random and if the latent binary value is the proposed new 
value is 1 or vice versa. However, when p S> p-y; , where p-y^ is the size of the current model for the Zth 
chain, the number of sweeps required to select by chance a binary indicator with a value of 1 follows 
a geometric distribution with probabiUty p-y/p which is much smaller than 1 — p^/p to select a binary 
indicator with a value of 0. Hence, the algorithm spends most of the time trying to add rather than delete 
a variable. Note that this problem also affects RJ-type algorithms (Dellaportas et al., 2002). On the other 
hand, Gibbs sampUng (George and McCuUoch, G&McC hereafter, 1993) is not affected by this issue 
since the state of the Zth chain is updated by sampUng from 

[p{ll,3 = 1 b'Tij-,^)]^^*' exp|(^logp {y +logp(7Zj = 1 \ll,3-)) A^} ' (1^) 

where 7^ j- indicates for the Zth chain all the variables, but the jth, j = 1, . . . , p and 

= ■ ■ = 1, • • • , li,pV- The main problem related to Gibbs sampUng is the 

large number of models it evaluates if a full Gibbs cycle or any permutation of the indices is implemented 
at each sweep. Each model requires the direct evaluation, or at least the update, of the time consuming 



quantity S (7), equation (7) or (9), making practically impossible to rely solely on the Gibbs sampler 
when p is very large. However, as sharply noticed by Kohn et al. (2001), it is wasteful to evaluate all the 
p updates in a cycle because if p-yj is much smaller than p and given 7/ j = 0, it is likely that the sampled 
value of 7/j is again 0. 



When p is large, we thus consider instead of the standard MC"^ add/delete, swap moves, a novel 
Fast Scan MetropoUs-Hastings scheme, FSMH hereafter, speciaUsed for EMC/PT. It is computationally 
less demanding than a full Gibbs sampling on all 7;^^ and do not suffer from the problem highlighted 
before for MC^ and RJ-type algorithms when p S> . The idea behind the FSMH move is to use 
an additional acceptance/rejection step (which is very fast to evaluate) to choose the number of indices 
where to perform the Gibbs-like step: the novelty of our FSMH sampler is that the additional probabiUty 
used in the acceptance/rejection step is based not only on the current chain model size p^^, but also on 
the temperature ti attached to the Zth chain. Therefore the aim is to save computational time in the large 
p set-up when multiple chains are simulated in parallel and finding an alternative scheme to a full Gibbs 
sampler. To save computational time our strategy is to evaluate the time consuming marginal UkeUhood 



(5) in no more than approximately 



^; (1/^0 {P ~ Pi) + ^f^l {^hi)Pi times per cycle in the /th chain 
(assuming convergence is reached), where 6^^^ (1/^0 is the probabiUty to select a variable to be added 
in the acceptance/rejection step which depends on the current model size p^, and the temperature ti and 
similarly for (1/*/) (L'J indicates the integer part). Since for chains attached to lower temperatures 
of^l (l/ti) » 6^^^^ the algorithm proposes to update almost all binary indicators with value 1, 

while it selects at random a group of approximately O^^^ (l/ti) {p — pj) binary indicators with value 
to be updated. At higher temperatures since O^^^^ and 9^^^ become more similar, the number of models 
evaluated in a cycle increases because much more binary indicators with value are updated. Full details 
of the FSMH scheme is given in the Appendix A.l, while evaluation of them and comparison with MC^ 
embedded in EMC are presented in Sections 4. 1 and 4.2 
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Global move: crossover operator 

The first step of this move consists of selecting the pair of chains {I, r) to be operated on. We firstly com- 
pute a probability equal to the weight of the "Boltzmann probability", pt {ji |t) = exp {/ (7; \t) /t} /Ft, 
where / (7; |r ) = logp (7; |y, r) + \ogp (7/) is the log transformation of the full conditional (10) as- 
suming = 1 VZ, Z = 1, . . . , L, and Ft = "^j^i exp {/ (7; |t) /t} for some specific temperature t, and 
then rank all the chains according to this. We use normalised Boltzmann weights to increase the chance 
that the two selected chains will give rise, after the crossover, to a new configuration of the population 
with higher posterior probabihty. We refer to this first step as "selection operator". 

Suppose that two new latent binary vectors are then generated from the selected chains according to 
some crossover operator described below. The new proposed population of chains 
7' = (71, . . . , 7^, . . . , 7^, . . . , 7i) is accepted with probability 

a ^7 ^ 7 j mm ^ ^ ^^^^ (7 ^ Y k ) / ' ^ ^ 

where Qt (7 ^ 7' |r ) is the proposal probability, see Liang and Wong (2000). 

In the following we will assume that four different crossover operators are selected at random at every 
EMC sweep: 1-point crossover, uniform crossover, adaptive crossover (Liang and Wong, 2000) and a 
novel block crossover. Of these four moves, the uniform crossover which "shuffles" the binary indicators 
along all the selected chains is expected to have a low acceptance, but to be able to genuinely traverse 
regions of low posterior probability. The block crossover essentially tries to swap a group of variables 
that are highly correlated and can be seen as a multi-points crossover whose crossover points are not 
random but defined from the correlation structure of the covariates. In practice the block crossover is 
defined as follows: one variable is selected at random with probabihty 1/p, then the pairwise correlation 
p (^Xj,Xjr^ between the jth selected predictor and each of the remaining covariates, j' = 1,. . . ,p, 
j' 7^ j, is calculated. We then retain for the block crossover all the covariates with positive (negative) 
pairwise correlation with Xj such that \p (^Xj,Xji) | > po- The threshold po is chosen with consideration 
to the specific problem, but we fixed it at 0.25. Evaluation of block crossover and comparisons with other 
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crossover operators are presented on a real data example in Section 4.1. 
Global move: exchange operator 

The exchange operator can be seen as an extreme case of crossover operator, where the first proposed 
chain receives the whole second chain state 7^ = 7^, and vice versa. In order to achieve a good acceptance 
rate, the exchange operator is usually applied on adjacent chains in the temperature ladder, which hmits 
its capacity for mixing. To obtain better mixing, we implemented two different approaches: the first 
one is based on Jasra et al. (2007) and the related idea of delayed rejection (Green and Mira, 2001); 
the second, a bolder "all-exchange" move, is based on a precalculation of all the L (L — 1) /2 exchange 
acceptance rates between all chains pairs (Calvo, 2005). Full relevant details are presented in Appendix 
A.l. Both of these bold moves perform well in the real data applications, see Section 4.1, and simulated 
examples, see Section 4.2, thus contributing to the efficiency of the algorithm. 

Temperature placement 

As noted by Goswami and Liu (2007), the placement of the temperature ladder is the most important 
ingredient in population based MCMC methods. We propose a procedure for the temperature placement 
which has the advantage of simplicity while preserving good accuracy. First of all, we fix the size L of 
the population. In doing this, we are guided by several considerations: the complexity of the problem, 
i.e. E {p.y), the size of the data and computational limits. We have experimented and we recommend to 
fix L > 3. Even though some of the simulated examples had 2± 20 (Section 4.2), we found that L = 5 
was sufficient to obtain good results. In our real data examples (Section 4.1), we used L = 4 guided by 
some prior knowledge on E (p-y). Secondly, we fix at an initial stage, a temperature ladder according 
to a geometric scale such that ti^i/ti = b, b > 1, I = 1, . . . , L with b relatively large, for instance 
6 = 4. To subsequently tune the temperature ladder, we then adopt a strategy based on monitoring only 
the acceptance rate of the delayed rejection exchange operator towards a target of 0.5. Details of the 
implementation are left in Appendix A. 1 
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3.2 Adaptive Metropolis-within-Gibbs for r 

Various strategies can be used to avoid having to sample from the posterior distribution of the variable 
selection coefficient r. The easiest way is to integrate it out through a Laplace approximation (Tiemey 
and Kadane, 1986) or using a numerical integration such as quadrature on an infinite interval. We do 
not pursue these strategies and the reasons can be summarised as follows. Integrating out r in the 
population imphcitly assumes that every chain has its own value of the variable selection coefficient ti 
(and of the latent binary vector 7;). In this set-up, two unpleasant situations can arise: firstly, if a Laplace 
approximation is applied, equilibrium in the product space is difficult to reach because the posterior 
distribution of 7; depends, through the marginal likelihood obtained using the Laplace approximation, 
on the chain specific value of the posterior mode for r/, f-y, (details in Appendix A.2). Since the strength 
of X-yj to predict the response is weakened for chains attached to high temperatures, it turns out that for 
these chains, f-y, is Ukely to be close to zero. When the variable selection coefficient is very small, the 
marginal hkeUhood dependence on Xj^ decreases even further, see for instance (7), and chains attached 
to high temperatures will experience a very unstable behaviour, making the convergence in the product 
space hard to reach. In addition, if an automatic tuning of temperature ladder is applied, chains will 
increasingly be placed at a closer distance in the temperature ladder to balance the low acceptance rate 
of the global moves, negating the purpose of EMC. 

In this paper the convergence is reached instead in the product space Hi^i \P ill Iv^ )]^^*' P (''')' 
the whole population is conditioned on a value of r common to all chains. This strategy will alleviate 
the problems highhghted before allowing for faster convergence and better mixing among the chains. 
The procedure just described comes with an extra cost, i.e. sampling the value of r. However, this step 
is inexpensive in relation to the cost required to sample 7;, Z = 1, . . . , L. There are several strategies 
that can be used to sample r from (11). We found useful to apply the idea of adaptive MetropoUs- 
within-Gibbs described in Roberts and Rosenthal (2008). Conditions for the asymptotic convergence 
and ergodicity are guaranteed as we enforce the diminishing adaptive condition, i.e. the transition kernel 
stabilises as the number of sweeps goes to infinity and the bounded convergence condition, i.e. the 
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convergence time of the kernel is bounded in probability. In our set-up using an adaptive proposal to 
sample r has several benefits; amongst others it avoids the known problems faced by the Gibbs sampler 
when the prior is proper, but relatively flat (Natarajan and McCuUoch, 1998) as can happen for Z-S 
priors when n is large or for the independent case considered by Bae and Mallick (2004). Moreover, 
given an upper limit on the number of sweeps, the adaptation guarantees a better exploration of the tails 
of p (r |y ) than with a fixed proposal. For details of the implementation and discussion of conditions for 
convergence, see Appendix A.2. 

3.3 ESS algorithm 

In the following, we refer to our proposed algorithm. Evolutionary Stochastic Search as ESS. If ^-priors 
are chosen the algorithm is denoted as ESS^', while we use ESSi if independent priors are selected (the 
same notation is used when r is fixed or given a prior distribution). Without loss of generaUty, we assume 
that the response vector and the design matrix have both been centred and, in the case of independent 
priors, that the design matrix is also rescaled. Based on the two full conditionals (10) and (11) and the 
local and global moves introduced earher, our ESS algorithm can be summarised as follows. 

• Given r, sample the population's states 7 from the two steps: 

(i) With probability 0.5 perform local move and with probability 0.5 apply at random one of the four 
crossover operators: 1-point, uniform, block and adaptive crossover. If local move is selected, use 
FSMH sampUng scheme independently for each chain (see Appendix A.l). Moreover every 100 

sweeps apply on the first chain a complete scan by a Gibbs sampler. 

(ii) Perform the delayed rejection exchange operator or the all-exchange operator with equal probabil- 
ity. During the bum-in, only select the delayed rejection exchange operator. 

• When r is not fixed but has a prior p (r), given the latent binary configuration 7 = (71, . . . , 7^), 
sample r from an adaptive Metropolis-within-Gibbs sampling (Section 3.2). 

From a computational point of view, we used the same fast form for updating S (7) as Brown et al. 
(1998), based on the QR-decomposition. Besides its numerical benefits, QR- decomposition can deal 
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with the case > n. This avoids having to restrict the search to models with < n, and helps mixing 
during the bum-in phase. 

4 Performance of ESS 
4.1 Real data examples 

The first real data example is an appUcation of linear regression to investigate genetic regulation. To 
discover the genetic causes of variation in the expression (i.e. transcription) of genes, gene expression 
data are treated as a quantitative phenotype while genotype data (SNPs) are used as predictors, a type of 
analysis known as expression Quantitative Trait Loci (eQTL). 

Here we focus on the abiUty of ESS to find a parsimonious set of predictors in an animal data set 
(Hubner et al, 2005), where the number of observations, n = 29, is small with respect to the number 
of covariates p = 1,421. This situation, where n ^ p, is quite common in animal experiments since 
environmental sources of variation are controlled as well as the biological diversity of the sample. For 
illustration, we report the analysis of one gene expression response, where we apply ESS^ with and 
without the hyperprior on r, see Table 1- eQTL. In the former case, thanks to the adaptive proposal, the 
Markov chain for r mixes very well reaching an overall acceptance rate which is close to the target value 
0.44. Convergence issue is not a problem since the trace of the proposal's standard deviation stabiUses 
quickly and well inside the bounded conditions, see Figure 3. 

In both cases a good mixing among the L = 4 chains is obtained (Figure 1, top panels, ESS 5 with 
r = 29). Although in the case depicted in Figure 1 with fixed r, the convergence is reached in the product 
space b i^i |y)]^^*'' by visual inspection we see that each chain marginally reaches its equilibrium 
with respect to the others; moreover, thanks to the automatic tuning of the temperature placement during 
the bum-in, the distributions of the chains log posterior probabiUties overlap nicely, allowing effective 
exchange of information between the chains. Table 1-eQTL, confirms that the automatic temperature 
selection works well (with and without the hyperprior on r) reaching an acceptance rate for the monitored 
exchange (delayed rejection) operator close to the selected target of 0.50. The all-exchange operator 
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shows a higher acceptance rate, while, in contrast to Jasra et al. (2007), the overall crossover acceptance 
rate is reasonable high: in our experience the good performance of the crossover operator is both related 
to the selection operator (Section 3.1) and the new block crossover which shows an acceptance rate far 
higher than the others. Finally the computational time on the same desktop computer (see details in 
Appendix B.3) is rather similar with or without the hyperprior r, 28 and 30 minutes respectively for 
25, 000 sweeps with 5, 000 as bum-in. 

The main difference among the two implementations of ESS^ is related to the posterior model size: 
when T is fixed at r = 29 (Unit Information Prior, Fernandez et al., 2001), there is more uncertainty 
and support for larger models, see Figure 2 (a). In both cases we fix E [p^) = 4 and V {p^) = 2, 
following prior biological knowledge on the genetic regulation. The posterior mean of the variable se- 
lection coefficient is a httle smaller than the Unit Information Prior, with ESSg' coupled with the Z-S 
prior favouring smaller models than when r is set equal to 29. The best model visited (and the cor- 
responding = 1 — S{'y)/y'^y) is the same for both version of ESS^f, while, when a hyperprior on 
r is implemented, the "stability index" which indicates how much the algorithm persists on the first 
chain top 1, 000 (not unique) visited models ranked by the posterior probability (Appendix B.3), shows a 
higher stabiUty, see Table 1- eQTL. In this case, having a data-driven level of shrinkage helps the search 
algorithm to better discriminate among competing models. 

Our second example is related to the application of model (1) in another genomics example: 10, 000 
SNPs, selected genome-wide from a candidate gene study, are used to predict the variation of Mass 
Spectography metabolomics data in a small human population, an example of a so-called mQTL exper- 
iment. A suitable dimension reduction of the data is performed to divide the spectra in regions or bins 
and log 10 -transformation is applied in order to normalise the signal. 

We present the key findings related to a particular metabolite bin, but the same comments can be 
extended to the analysis of the whole data set, where we regressed every metaboUtes bin versus the 
genotype data (n = 50 and p = 10, 000). In this very challenging case, we still found an efficient mixing 
of the chains (see Table 1-mQTL). Note that in this case the posterior mean of r, 63.577, is a little 
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larger than the Unit Information Prior, t = n, although the influence of the hyperprior is less important 
than in the previous real data example, see Figure 2 (b). In both examples, the posterior model size 
favours clearly polygenic control with significant support for up to four genetic control points (Figure 
2) highlighting the advantage of performing multivariate analysis in genomics rather than the traditional 
univariate analysis. 

As expected in view of the very large number of predictors, in the mQTL example the computational 
time is quite large, around 5 hours for 20, 000 sweeps after a bum-in of 5, 000, but as shown in Table 1 
by the "stability index" (Ri 0), we believe that the number of iterations chosen exceeds what is required 
in order to visit faithfully the model space. For such large data analysis tasks, parallelisation of the code 
could provide big gains of computer time and would be ideally suited to our multiple chains approach. 

[Table 1 about here - Figure 1 about here - Figure 2 about here - Figure 3 about here] 

We also evaluate the superiority of our ESS algorithm, and in particular the FSMH scheme and the 
block crossover, with respect to more traditional EMC implementations illustrated for instance in Liang 
and Wong (2000). Albeit we believe that using a wide portfoho of different moves enables any searching 
algorithm to better explore complicated model spaces, we reanalysed the first real data example, eQTL 
analysis, comparing: (i) ESSg with only FSMH as local move vs ESS with only MC^ as local move; (ii) 
ESS^ with only block crossover vs ESSg with only 1-point, only uniform and only adaptive crossover 
respectively. To avoid dependency of the results on the initiaUsation of the algorithm, we rephcated the 
analysis 25 times. Moreover, to make the comparison fair, in experiment (i) we run the two versions 
of ESSg for a different number of sweeps (25, 000 and 350, 000 with 5, 000 and 70, 000 as bum-in 
respectively), but matching the number of models evaluated. Results are presented in Table 2. We report 
here the main findings: 

(i) over the 25 mns, ESSg^ with FSMH reaches the same top visited model 68% (17/25) while ESSg 
with MC^ the same top model only 28%, with a fixed r, and 88% and 40% respectively with Z-S 
prior. This ability is extended to the top models ranked by the posterior probability, data not shown, 
providing indirect evidence that the proposed new move helps the algorithm to increase its predictive 
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power. The great superiority when FSMH scheme are implemented can be explained by comparing 
subplot (a) and (c) in Figure 1 : the exchange of information between chains for ESSg with MC^ as 
local move when p > n (and p S> Pj) is rather poor, negating the purpose of EMC. ESSg with MC^ 
has more difficulties to reach convergence in the product space and, in contrast to ESS^ with FSMH, 
the retained chain does not easily escape from local modes. This later point can be seen looking at 
Figure 1 (d) which magnifies the right hand tail of the kernel density of logp (7 |y ) for the recorded 
chain, pulling together the 25 runs: interestingly ESSg with FSMH is less "bumpy", showing a better 
ability to escape from local modes and to explore more efficiently the right tail, 
(ii) Regarding the second comparison when r is fixed, ESSg with only block crossover beats constantly 
the other crossover operators, with 80% vs about 60%, in terms of best model visited (Table 2) and 
models with higher posterior probability (data not shown), has higher acceptance rate (Table 3), show- 
ing also a great capacity to accumulate posterior mass as illustrated in Figure 4. The specific benefit 
of the block crossover is less pronounced when a prior on r is specified, but we have already noticed 
that in this case having a hyperprior on r greatly improves the efficiency of the search. 

[Table 2 about here - Table 3 about here - Figure 4 about here] 

4.2 Simulation study 

We briefly report on a comprehensive study of the performance of ESS in a variety of simulated examples 
as well as a comparison with SSS. To make comparison with SSS fair, we use ESSi, the version of our 
algorithm which assumes independent priors, E-y = r/p^,with r fixed at 1. Details of the simulated 
examples (6 set-ups) and how we conducted the simulation experiment (25 replication of each set-up) 
are given in Appendix B. The rationale behind the construction of the examples was to benchmark our 
algorithm against both n > p and p > n cases, to use as building blocks intricate correlation structures 
that had been used in previous comparisons by G&McC (1993, 1997) and N&G (2004), as well as a 
reahstic correlation structure derived from genetic data, and to include elements of model uncertainty in 
some of the examples by using a range of values of regression coefficients. 
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In our example we observe an effective exchange of information between the chains (reported in Ta- 
ble 4) which shows good overall acceptance rates for the collection of moves that we have implemented. 
The dimension of the problem does not seem to affect the acceptance rates in Table 4, remarkably since 
values of p range from 60 to 1, 000 between the examples. We also studied specifically the performance 
of the global moves (Table 5) to scrutinise our temperature tuning and confirmed the good performance 
of ESSz with good frequencies of swapping (not far from the case where adjacent chains are selected to 
swap at random with equal probability) and good measures of overlap between chains. 

All the examples were run in parallel with ESSi and SSS 2.0 (Hans et al, 2007) for the same number 
of sweeps (22,000) and matching hyperparameters on the model size. Comparison were made with 
respect to the marginal probabiUty of inclusion as well as the ability to reach models with high posterior 
probabihty and to persist in this region. For a detailed discussion of all comparison, see Appendix B.3. 

Overall the covariates with non-zero effects have high marginal posterior probabihty of inclusion for 
ESSz in all the examples, see Figure 6. There is good agreement between the two algorithms in general, 
with additional evidence on some examples (Figure 6 (c) and (d)) that ESSi is able to explore more fully 
the model space and in particular to find small effects, leading to a posterior model size that is close to 
the true one. Measures of goodness of fit and stability. Table 6, are in good agreement between ESSi and 
SSS. The comparison highhght that a key feature of SSS, its ability to move quickly towards the right 
model and to persist on it, is accompanied by a drawback in having difficulty to explore far apart models 
with competing explanatory power, in contrast to ESSz (contaminated example set-up). Altogether ESSz 
shows a small improvement of R^, related to its ability to pick up some of the small effects that are 
missed by SSS. Finally ESSz shows a remarkable superiority in terms of computational time, especially 
when the simulated (and estimated) is large. Altogether our comparisons show that we have designed 
a fully Bayesian MCMC-EMC sampler which is competitive with the effective search provided by SSSi 

In the same spirit of the real data example analysis, we also evaluate the superiority of the FSMH 
scheme with respect to more traditional EMC implementations, i.e when a MC^ local move is selected. 
While both versions of the search algorithm visit almost the same top models ranked by the posterior 
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probability, ESS persists more on the top models. 



[Table 4 about here - Table 5 about here - Table 6 about here 
Figure 5 about here - Figure 6 about here] 

5 Discussion 

The key idea in constructing an effective MCMC sampler for 7 and r is to add an extra parameter, the 
temperature, that weakens the likelihood contribution and enables escaping from local modes. Running 
parallel chains at different temperature is, on the other hand, expensive and the added computational cost 
has to be balanced against the gains arising from the various "exchanges" between the chains. This is 
why we focussed on developing a good strategy for selecting the pairs of chains, using both marginal 
and joint information between the chains, attempting bold and more conservative exchanges. Combining 
this with an automatic choice of the temperature ladder during bum-in is one of the key element of our 
ESS algorithm. Using PT in this way has the potential to be effective in a wide range of situations where 
the posterior space is multimodal. 

To tackle the case where p is large with respect to p^, the second important element in our algo- 
rithm is the use of a MetropoUsed Gibbs sampUng-Uke step performed on a subset of indices in the 
local updating of the latent binary vector, rather than an MC^ or RJ-hke updating move. The new Fast 
Scan Metropolis Hastings sampler that we propose to perform these local moves achieves an effective 
compromise between full Gibbs sampling that is not feasible at every sweep when p is large and vanilla 
add/delete moves. Comparison of FSMH vs MC^ scheme on a real data example and simulation study 
shows the superiority of our new local move. 

When a model with a prior on the variable selection coefficient r is preferred, the updating of r 
itself present no particular difficulties and is computationally inexpensive. Moreover, using an adaptive 
sampler makes the algorithm self contained without any time consuming tuning of the proposal variance. 
This latter strategy works perfectly well both in the g-prior and independent prior case as illustrated in 
Sections 4. 1 and 4.2. Our current implementation does not make use of the output of the heated chains 
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for posterior inference. Whether gains in variance reduction could be achieved in the spirit of Gramacy 
et al. (2007) is an area for further exploration, which is beyond the scope of the present work. 

Our approach has been apphed so far to linear regression with univariate response y. An interesting 
generaUsation is that of a multidimensional n x q response Y and the identification of regressors that 
jointly predict the Y (Brown et al., 1998). Much of our set-up and algorithm carries through without 
difficulties and we have already implemented our algorithm in this framework in a challenging case 
study in genomics with multidimensional outcomes. 
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Appendix 

A Technical details of EMC implementation 

In this Section we will describe some technical details omitted from the paper and related to the samphng 
schemes we used for the population of binary latent vectors 7 and the selection coefficient r. 

A.l EMC sampler for 7 

Local move: FSMH scheme 

Let 7;^j, I = 1, . . . , L and j = 1, ... ,p to denote the jth latent binary indicator in the l\h chain. As in 
Kohn et al. (2001), let 7^^.^ = (7,1, . . . , 7j-i, 7j = 1, 7j+i, • • • , ll,pf and 

= • • • ' 11,3-1^11,3 = 0, li,j+i, li,pY ■ Furthermore let if^- (xp(y iIj,t^ and oc 
p (^y lif and finally O^^j = p ['jij = 1 \ ) and of^j = 1 — oj^j. From (6) it is easy to prove that 

21 



where p-y^ is the current model size for the Ith chain. Using the above equation, for 7; j = 1 the normalised 
version of (12) can be written as 

^(1) ^(1) m 

[P ill, = 1 \y,7u-,r)Y^'' = , (A.2) 



I) 

where 5(1^0 = + 4°)'^*' lJ J with [p{-fij = 1 \y,lij-,r)Y^" defined simi- 

larly. Hence if Oi^j ^ ' is very small, then [p (yij = l\y, jij- , r)] is small as well. Therefore for 
the Gibbs sampler with a beta-binomial prior on the model space, the posterior probability of ^ij = 1 
depends crucially on 6} ■ 

''1J 

In the following we derive a Fast Scan Metropolis-Hastings scheme speciahsed for Evolutionary 
Monte Carlo or parallel tempering. We define Q{1^ Q) = Q (^^^^ j^) proposal probability 
to go from 1 to and Q (0 ^ 1) the proposal probability to go from to 1 for the jth variable and Zth 
chain. Moreover using the notation introduced before, the Metropolis-within-Gibbs version of (12) to go 
from to 1 in the EMC local move is 

(0^1)= min 1, , ^ ^ ^ ° (A.3) 

' ^ > 1 ' (o)iA< (o)iAQ(O^l) f 

with a similar expression for aj^™*^ (1 — >■ 0). The proof of the Propositions are omitted since they are 
easy to check. We first introduce the following Proposition which is useful for the calculation of the 
acceptance probability in the FSMH scheme. 

(0) ^/*' / (1) 

Proposition 1 The following three conditions are equivalent: a) L) ■ / L) ■ > 1 ; 
h) /S (1 A) > 1 ; c)Lf] /S (l/ti) < 1 , where S {1/ti) = S (l/U) j (^^^ + df] 

is tiie convex combination of the marginal likelihood Lj^J and Lj^J with weights 6^ J {^/U) = 
/ ( ^ + ,(») an, (l/y = 1 - (l/y. 



The FSMH scheme can be seen as a random scan Metropolis-within-Gibbs algorithm where the num- 
ber of evaluations is hnked to the prior/current model size and the temperature attached to the chain. The 
computation requirement for the additional acceptance/rejection step is very modest since the normahsed 
tempered version of (A.l) is used. 
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Proposition 2 Letl = 1, . . . ,L, j = I, . . . ,p (or any permutation of them), Q^^^^ (0^1) = i9[^.^ [l/ti] 
andQP^^^{l ^ 0) = ^J°^ {1/ti) with 9^^^ (l/ti) = 1 - 9^^^ {1/ti). The acceptance probabihties are 



^t^MH (0 ^ 1) = <^ 



,FSMH 



(1 ^ 0) = <^ 



(A.4) 



(A.5) 



The above sampling scheme works as follows. Given the Zth chain, if ^ij = (and similarly for ^ij = 1), 
it proposes the new value from a Bemoulh distribution with probability 9i^j {1/ti): if the proposed value 
is different from the current one, it evaluates (A.4) (and similarly A.5)otherwise it selects a new covariate. 

Finally it can be proved that the Gibbs sampler is more efficient than the FSMH scheme, i.e. for 
a fixed number of iterations, Gibbs sampUng MCMC standard error is lower than for FSMH sampler. 
However the Gibbs sampler is computationally more expensive so that, if p is very large, as described in 
Kohn et al. (2001), FSMH scheme becomes more efficient per floating point operation. 

Global move: exchange operator 

The exchange operator can be seen as an extreme case of crossover operator, where the first proposed 
chain receives the whole second chain state 7^ = 7^, and the second proposed chain receives the whole 
first state chain 7^ = 7;, respectively. 

In order to achieve a good acceptance rate, the exchange operator is usually appUed on adjacent 
chains in the temperature ladder, which limits its capacity for mixing. To obtain better mixing, we 
implemented two different approaches: the first one is based on Jasra et al. (2007) and the related idea of 
delayed rejection (Green and Mira, 2001); the second one on Gibbs distribution over all possible chains 
pairs (Calvo, 2005). 

1. The delayed rejection exchange operator tries first to swap the state of the chains that are usually 
far apart in the temperature ladder, but, once the proposed move has been rejected, it performs a 
more traditional (uniform) adjacent pair selection, increasing the overall mixing between chains on 
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one hand without drastically reducing the acceptance rate on the other. However its flexibihty comes 
at some extra computational costs and in particular the additional evaluation of the pseudo move 
necessary to maintain detailed balance (Green and Mira, 2001). Details are reported below. 

Suppose two chains are selected at random, I and r with Z / r, in order to swap their binary latent 
vector. Then, given that 7^' = 7r-> 7r = 7; and Qt{l ^ l') = <5t (7' ^ 7)> (13) reduces to 

/ ^ A ■ /1 cxp{/(7r|T)/ti + /(7i|r)/tJ\ 

ai 7 ^ 7 = mm < 1, — — — t—^. — — t-^—. — r- > . 

^' I exp{/(7;|r)/t, + /(7.|r)/t,}/ 

Since the two chains are selected at random, the above acceptance probability decreases exponentially 

with the difference | l/t; — l/t^ | and therefore most of the proposed moves are rejected. If rejected, a 

delayed rejection-type move is applied between two random adjacent chains, with I the first one and 

s, \l — s\ = 1, the second one, giving rise to the new acceptance probability 

I exp{/(7/|r)/t; + /(7s|T)/tJ 1 - ai (7 ^ 7 ) J 
where the pseudo move 7* is necessary in order to maintain the detailed balance condition (Green 

and Mira, 2001). 

2. Alternatively, we attempt a bolder "all-exchange" operator. Swapping the state of two chains that are 
far apart in the temperature ladder speeds up the convergence of the simulation since it replaces several 
adjacent swaps with a single move. However, this move can be seen as a rare event whose acceptance 
probabiUty is low and unknown. Since the full set of possible exchange moves is finite and discrete, it 
is easy and computationally inexpensive to calculate all the L (L — 1) /2 exchange acceptance rates 
between all chains pairs, inclusive the rare ones, pi^^ = exp {(/ (7^ |t) — / (7; |t)) {1/ti — l/U)}. 
To maintain detailed balance condition, the possibility not to perform any exchange (rejection) must 
be added with unnormalised probability one. Finally the chains whose states are swopped are selected 
at random with probability equal to 

Z^h=l 

where in (A.6) each pair (Z, r < /) is denoted by a single number h, ph = pi^r^ including the rejection 
move, h = 1. 
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Temperature placement 

First we select the number L of chains close to the complexity of the problem, i.e. E (p-y), although the 
size of the data and computational limits need to be taken into account. Secondly, we fix a first stage 
temperature ladder according to a geometric scale such that U+i/ti = b, b > 1, I = 1, . . . , L with b 
relatively large, for instance 6 = 4. Finally, we adopt a strategy similar to the one described in Roberts 
and Rosenthal (2008), but restricted to the bum-in stage, monitoring only the acceptance rate of the 
delayed rejection exchange operator. After the kth "batch" of EMC sweeps, to be chosen but usually set 
equal to 100, we update 6^, the value of the constant b up to the A;th batch, by adding or subtracting an 
amount db such that the acceptance rate of the delayed rejection exchange operator is as close as possible 
to 0.50 (Liu, 2001; Jasra et al, 2007), 6^+1 = 2'°g2'';=±56. Specifically the value of 5^ is chosen such 
that at the end of the bum-in period the value of b can be 1. To be precise, we fix the value of (5^ as 
log2 /K, where b\ is the first value assigned to the geometric ratio and K is the total number of 
batches in the bum-in period. 

A.2 Adaptive Metropolis-within-Gibbs for r 

Laplace approximation for the conditional marginal likelihood 

Under model (1) and prior specification for a, (2) and (3), we provide the Laplace approximation of 
p (y |7, r ) for the ^-prior case, while the approximation for the independent case can be derived following 
the same line of reasoning. For easy of notation we drop the chain subscript index and we assume that 
the observed responses y have been centred with mean 0, i.e. {y — yn) = y. In the following we 
will distinguish the cases in which the posterior mode is a solution of a cubic or quadratic equation. 
Conditions on the existence of the solutions are provided as well as those that guarantee the positive 
semidefiniteness of the variance approximation. Recall that 
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where A is the posterior mode after the transformation A = log (r), which is necessary to avoid problems 
on the boundary, a-^ is the approximate squared root of the variance calculated in A and J (•) is the 
Jacobian of the transformation. Details about Laplace approximation can be found in Tiemey and Kadane 
(1986). Similar derivations when p (cr^) oc a~'^ are presented in Liang et al. (2008). Finally throughout 
the presentation we will assume that n> and that Ug and bg wee fixed small as in Kohn et al. (2001). 
Cubic equation for Zellner-Siow priors 

If p (r) = InvGa [ut, h-r) the posterior A mode is the only positive root of the integrand function 



-(2aa+n-l)/2 g-6r/e^ 



where the last factor in the above equation = |cie'*'/ciA| is the Jacobian of the transformation. After 
the calculus of the first derivative of the log transformation and some algebra manipulations, it can be 
shown that is the solution of the cubic equation 

3A I - C2C4 - (C3 + C4) a-r + Cjhr 2A , -C^jCLr + (cs + C4) K x, £36^ _ n 7^ 

(Ci — C2 — ttr) C4 [Ci — C2 — ttr) C4 [Ci — C2 — ar) C4 

and that 



^ (logp(y |7,A) + logp(A))" 



A=A 

A „ ...A -1 



e C3C4e br 

-Cl —5- + C2 — :t H r- 

(1 + e^y (C3 + C4e^)^ 



(A.8) 

A=A 

where Cl = {2aa + n — 1 — p^) /2, C2 = {2aa + n— 1) /2, C3 = 2ba + y^y and C4 = 2ba + 



y^y (1 — i?^). Following Liang al. (2008), since limx^-oodlx/dX > 0, because c^br > 0, and 
lim;^_^oo dIx/dX < 0, because (ci — C2 — a,-) C4 < 0, at least one real positive solution exists. More- 
over since — (036^) / (ci — C2 — a^) C4 > 0, the remaining two real solutions should have the same sign 
(Abramowitz and Stegun, 1970). A necessary condition for the existence of just one real positive solution 
is that the summation of all the pairs-products of the coefficients is negative 

-Csar + (C3 -I- C4) br ^ ^ 
(Ci - C2 - ar) C4 

and this happens if br/ar > C3/ (03 -|- C4). When ^ and thus C3 = C4, the above condition 
corresponds to br > ar/2 and when i?^ 1, as C3/ (C3 -I- C4) « 1 especially when y^y is large, which 
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might be expected when n becomes large, the condition is equivalent to h^- > a^-. Therefore it turns out 
that a sufficient condition for the existence of just one real positive solution in (A. 1) is b-r > a-r- 

The positive semidefiniteness of the approximate variance can be proved as follows. First of all it 
is worth noticing that all the terms in (A.8) are of the same order Op (e"^) . Then, when — > 0, the 
positive semidefiniteness is always guaranteed, while when i?^ 1, provided that y^y is large, the 
middle term in (A.8) tends to zero and the condition is fulfilled if 6^ > ci- 
Quadratic equation for Liang et al. (2008) prior 

If p (r) oc (1 + t)~'^'^, with cv > 0, is only the positive root of the integrand function 

-(2aa+n-l)/2 



= (l + e^) ^---^/^ {26. (l + e^) + y-y [l + e^l - i?^)] }" 

or, after the first derivative of the log transformation, the solution of the quadratic equation 

{cl -C2 + 1) C^e^^ + (CJC3 - C2C4 + C3 + C4) 6^ + C3 = (A.9) 

with c| = [20(j + n — 1 — (p-y + 2ct-)] /2 and C2, C3 and C4 defined as above. The discriminant of the 
quadratic equation is A = (c*C3 — C2C4C3 + C3 + 04)^ — 4 (c* — C2 + 1) C4C3 which is always greater 
than zero and therefore two real roots exist. Since one of them is positive in order to prove that (A.9) 
admits just one positive solution, it is necessary to show that 

- (C|C3 - C2C1 + C3 + C4) - AV^ ^ ^ 
2 {c\ - C2 + 1) C4 

which is true provided that (c^ — C2 + 1) C4C3 < 0. Moreover the approximate variance can be written 



as 



-Cl7^^;2 + ■ 772 



(A.IO) 



(1 + e>^y (C3 + C4e^) 
which is positive semidefinite when — > if C2 > c^, which is always verified, while, if — > 1 and 
y^y is large, equation (A.IO) is not positive unless + 2cr > 20. + n — 1. 
The expUcit solution of the posterior mode is also available 

f (C4 - C3) / (C* - C2) ^ ^ 

max { , \ — 1,0 
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'^^'^ \ [2h^/ {yTy) + {1-R2)]/ [2a^ + „ _ i _ (^^ + 2c^)] 

which corresponds to MLE if Cr = 0. 

Diminishing adaptive and bounded conditions 

Since r is defined on the real positive axis we propose the new value of r on the logarithm scale. In par- 
ticular we use as proposal the normal distribution centred at the current value of log (r) in the ^-prior and 
independent prior case. The variance of the proposal distribution is controlled as illustrated in Roberts 
and Rosenthal (2008): every 100 EMC sweeps, the same value of sweeps used in the temperature place- 
ment, we monitor the acceptance rate of the Metropolis-within-Gibbs algorithm: if it is lower (higher) 
than the optimal acceptance rate, i.e. 0.44, a constant dr{k) is added (subtracted) to Isk, the log standard 
deviation of the proposal distribution in the kth batch of EMC sweeps. The value of the constant to be 
added or subtracted is rather arbitrary, but we found useful to fix it as |Zsi — 5| /^, where K is the total 
number of batches in the bum-in period: during the burn-in the log standard deviation should be able 
to reach any values at a distance ±5 in log scale from the initial value of Isi usually set equal to zero. 
The diminishing adaptive condition is obtained imposing 5^ (k) = min {\lsi - 5| /K,k-'^/^}, where k 
is the current number of batches, including the bum-in. To ensure the bounded convergence condition we 
follow Roberts and Rosenthal (2008), restricting each Is^ to be inside [Mi, M2] and we fix them equal 
to Ml = — 10 and M2 = 10 respectively. In practise these bounds do not create any restriction since the 
sequence of the standard deviations of the proposal distribution stabiUses almost immediately, indicating 
that the transition kemel converges in a bounded number of batches, see Figure 2. 

B Performance of ESS: Simulation study 

In this Section we report in details on the performance of ESS in a variety of simulated examples. Main 
conclusions are summarised in the Section 4.2. 

Firstly we analyse the simulated examples with ESSi the version of our algorithm which assumes 
independent priors, S-y = rlp^, so as to enable comparisons with SSS which also implements an inde- 
pendent prior. Moreover, in order to make to comparison with SSS fair, in the simulation study only the 
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(A. 11) 



first step of the algorithm described in Section 3.3 is performed, with r fixed at 1. As in SSS, standard- 
isation of the covariates is done before running ESSi We run ESSi and SSS 2.0 (Hans et al, 2007) for 
the same number of sweeps (22,000) and with matching hyperparameters on the model size. 

Secondly, to discuss the mixing properties of ESS when a prior p (r) is defined on r, we implement 
both the 5-prior and independent prior set-up for a particular simulated experiment. To be precise in the 
former case we will use the Zellner-Siow priors (8), and for the latter we will specify a proper but diffuse 
exponential distribution as suggested by Bae and Mallick (2004). 

B.l Simulated experiments 

We apply ESS with independent priors to an extensive and challenging range of simulated examples 
with r fixed at 1: the first three examples (Exl-Ex3) consider the case n > p while the remaining 
three (Ex4-Ex6) have p > n. Moreover in all examples, except the last one, we simulate the design 
matrix, creating more and more intricated correlation structures between the covariates in order to test 
the proposed algorithm in different and increasingly more realistic scenarios. In the last example, we use, 
as design matrix, a genetic region spanning 500-kb from the HapMap project (Altshuler et al., 2005). 

Simulated experiments Exl-Ex5 share in common the way we build X. In order to create moderate 
to strong correlation, we found useful referring to two simulated examples in George and McCuUoch, 
G&McC hereafter, (1993) and in G&McC (1997): throughout we call Xi in x 60) and X2 (n x 15) 
the design matrix obtained from these two examples. In particular the j\h column of Xi, indicated 
as is simulated as = X* + Z, where XI,. . . ,XgQ iid ^ Nn (0, 1) independently form 

Z ^ Nn (0, 1), inducing a pairwise correlation of 0.5. X2 is generated as follows: firstly we simulated 
Zi, . . . , Zi5 iid ~ Nn (0, 1) and we set X(2)j = IZj for j = 1, 3, 5, 8, 9, 10, 12, 13, 14, 15 only. To 
induce strong multicolhnearity, we then set X(2)2 = -^^(2)1 + 0.15^2, -^^(2)4 = -^^(2)3 + O.I5Z4, X(<i)% = 

^(2)5 + 0.15^6, X(2)7 = X(2)8 -|- ^(2)9 - ^(2)10 + O.lSZy and X(2)ii = ^(2)14 + -^^(2)15 " -^^(2)12 " 

X(2)i3 -I- O.I5Z11. A pairwise correlation of about 0.998 between X^2)j and X^2)j+i for j = 1, 3, 5 is 
introduced and similarly strong Unear relationship is present within the sets (-^(2)7) -^(2)8; ^(2)9; -^^(2)10) 
and (^(2)11, -^^(2)12! -^^(2)13) -'^(2)14! ^(2)15) • 
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Then, as in Nott and Green, N&G hereafter, (2004) Example 2, more complex structures are created 
by placing side by side combinations of Xi and/or X2, with different sample size. We will vary the 
number of samples n in Xi and X2 as we construct our examples. The levels of (5 wee taken from the 
simulation study of Fernandez et al. (2001), while the number of true effects, p-y, with the exception of 
Ex3, varies from 5 to 16. Finally the simulated error variance ranges from 0.05^ to 2.5^ in order to vary 
the level of difficulty for the search algorithm. Throughout we only Ust the non-zero /3-y and assume that 
= 0^. The six examples can be summarised as follows: 

Exl: X = Xi is a matrix of dimension 120 x 60, where the responses are simulated from (1) using a = 0, 
7 = (21,37,46,53,54)^, = (2.5, 0.5, -1, 1.5, 0.5)^, and e - N (O, 22/120). In the following we 
will not refer to the intercept a any more since, as described in Section 3.3 in the paper, we consider 
y centred and hence there is no difference in the results if the intercept is simulated or not. This is the 
simplest of our example, although, as reported in G&McC (1993) the average pairwise correlation is 
about 0.5, making it already hard to analyse by standard stepwise methods. 

Ex2: This example is taken directly from N&G (2004), Example 2, who first introduce the idea of combining 



simpler "building blocks" to create a new matrix X : in their example X = 



^(1)^(2) 



is a 300 X 



30 matrix, where Xg^^ and X2^ are of dimension 300 x 15 and have each the same structure as 
X2. Moreover 7 = (1,3,5,7,8,11,12,13)^, (3^ = (1.5,1.5,1.5,1.5,-1.5,1.5,1.5,1.5)'^ and £ ~ 
N (0, 2.5^/300)- We chose this example for two reasons: firstly, since the correlation structure in X2 
is very involved, we test the proposed algorithm under strong and complicated correlations between 
the covariates; secondly, since y is not simulated from the second "block", we are interested to see if 
the proposed algorithm does not select any variable that belongs to the second group. 
Ex3: As in G&McC (1993), Example 2,X = Xi, is a 120x60 matrix, ^ = (^Si, . . . , ^qq)^ , {Pi,..., P15) = 
(0, ... ,0), (/3i6, . . . ,/33o) = (1, • • • , 1), {P3i, ■ ■ ■ ,/345) = (2, ... ,2), (/346, . . . ,/36o) = (3, . . . ,3) and 
e N (0, 2^/120) • The motivation behind this example is to test the strength of the proposed algo- 
rithm to select a subset of variables which is large with respect to p while preserving the abiUty not to 
choose any of the first 15 variables. 
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Ex4: The design matrix X, 120 x 300, is constructed as follows: firstly we create anew 120 x 60 "building 
block", Xs, combining X2 and a smaller version of Xi, X*, a 120 x 45 matrix simulated as Xi, 
such that X3 = [X2Xj^] (dimension 120 x 60). Secondly we place side by side five copies of X3, 



X = 



yW y(2) v^(3) 3^(4) ^(5) 
^3 ^3 ^3 ^3 ^3 



: the new design matrix alternates blocks of covariates of high and 
complicated correlation, as in G&McC (1997), with regions where the correlation is moderate as in 
G&McC (1993). We simulate the response selecting 16 variables from X, 

7 = (1, 11, 30, 45, 61, 71, 90, 105, 121, 131, 150, 165, 181, 191, 210, 225)^ such that every pair be- 
longs alternatively to X2 or Xi. We simulate y using 

= (2,-1,1.5,1,0.5,2,-1,1.5,1,0.5,2,-1,-1,1.5,1,0.5)^ with e ~ N (0,2.52/120). This ex- 
ample is challenging in view of the correlation structure, the number of covariates p > n and the 
different levels of the effects. 



Ex5: This is the most challenging example that we simulated and it is based on the idea of contami 
nated models. The matrix X, 200 x 1000, is X = \x^^^xPxPxi*X^'^^X^^^X^^^xPx^^ 
with X^*, a 200 x 520 larger version of Xi. We partitioned the responses such that y = [yiy2]^- 



yi is simulated from "model 1" (7^ = (701,730,745,763,790,805,825,850,865,887) and - 



7 

(2, -1, 1.5, 1, 0.5, 2, -1, 1.5, 2, -1)) while y2 is simulated from "model 2" (7^ = (1, 38, 63, 98, 125) 
and = (2, -1, 1.5, 1, 0.5)). Finally, fixing s ^ N {O, O.O52/200) and the sample size in the two 
models such that yi and y2 are vectors of dimension 1 x 160 and 1 x 40 respectively, y is retained if, 
given the samphng variability, we find R^i > 0.6 and R'^i/S < R^2 < -R^i/10: in this way we know 
that "model 1" accounts for most of the variability of y, but without a neghgible effect for "model 2". 
In this example, we measure the ability of the proposed algorithm to recognise the most promising 
model and therefore being robust to contaminations. However since ESS can easily jump between 
local modes we are also interested to see if "model 2" is selected. 

Ex6: The last simulated example is based on phased genotype data from HapMap project (Altshuler et 
ai, 2005), region ENm014, Yoruba population: the data set originally contained 1,218 SNPs (Single 
Nucleotide Polymorphism) for 120 chromosomes, but after eUminating redundant variables, the design 
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matrix reduced to 120 x 775. While in the previous examples a "block structure" of correlated variables 
is artificially constructed, in this example blocks of Unkage disequilibrium (LD) derive naturally from 
genetic forces, with a slow decay of the level of pairwise correlation between SNPs. Finally we chose 
7 = (50, 75, 140, 200, 300, 400, 500, 650, 700, 770)^ such that the effects are visually inside blocks of 
LD, with their size simulated from (3^ ^ N (0, 3^/io) with £ ~ A'" (0, 0.10^Ii2o) • Since the simulated 
effects can range roughly between (—6, 6), this will allow us to test also the ability of ESSz to select 
small effects. 

We conclude this Section by reporting how we conducted the simulation experiment: every exam- 
ple from Exl to Ex6 has been replicated 25 times and the results presented for example Exl to Ex5 
are averaged over the 25 replicates. For Ex6 the effects size change so average across replicated is 
only done for the mixing properties. ESSz with r =1 was applied to each example/sample, record- 
ing the visited sequence of 71 for 20, 000 sweeps after a burn-in of 2, 000 required for the automatic 
tuning of the temperature placement. Section 3.1 With the exception of Ex2 and Ex3, where we used 
an indifferent prior, p (7) = (1/2)^, we analysed the remaining examples setting E (p^) = 5 with 
^ (p-y) = E (Pj) — E (p-y) /p) which corresponds to a binomial prior over pj. In order to estabUsh 
the sensitivity of the proposed algorithm to the choice of E (p^) we also analysed Exl fixing E [p^) = 10 
and 20. Moreover in all the examples we chose L = 5 with the starting value of 7 chosen at random. 
The remaining two hyperparameters to be fixed, namely aa- and b^, are set equal to a^- = 10~^ and 
ba = 10~^ as in Kohn et al. (2001) which corresponds to a relative uninformative prior. 

B.2 Mixing properties of ESSi 

In this Section we report some stylised facts about the performance of the ESSz with r fixed at 1. Figure 
5, top panels, shows for one of the rephcates of Exl, the overall mixing properties of ESSi. As expected, 
the chains attached to higher temperatures shows more variability. Albeit the convergence is reached 
in the product space H/^i \P ill 1?/)]^^*'' ^y visual inspection each chain marginally reaches its equilib- 
rium with respect to the others; moreover, thanks to the automatic tuning of the temperature placement 
during the bum-in, the distributions of their log posterior probabiUties overlap nicely, allowing effective 
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exchange of information between the chains. Figure 5, bottom panels, shows the trace plot of the log 
posterior and the model size for a replicate of Ex4. We can see that also in the case p > n, the chains 
mix and overlap well with no gaps between them, the automatic tuning of the temperature ladder being 
able to improve drastically the performance of the algorithm. 

This effective exchange of information is demonstrated in Table 4 which shows good overall accep- 
tance rates for the collection of moves that we have implemented. The dimension of the problem does 
not seem to affect the acceptance rate of the (delayed rejection) exchange operator which stays very sta- 
ble and close to the target: for instance in Ex4 (p = 300) and Ex6 (p = 775) the mean and standard 
deviation of the acceptance rate are 0.517 (0.105) and 0.497 (0.072) while in Ex5 (p = 1, 000) we have 
0.505 (0.013): the higher variabiUty in Ex4 being related to the model size pj. 

With regards to the crossover operators, again we observe stabiUty across all the examples. Moreover, 
in contrast to Jasra et al. (2007), when p > n, the crossover average acceptance rate across the five chains 
is quite stable between 0.147, Ex4, and 0.193, Ex6 (with the lower value in Ex4 here again due to p^): 
within our limited experiments, we believe that the good performance of crossover operator is related to 
the selection operator and the new block crossover, see Section 3.1. 

Some finer tuning of the temperature ladder could still be performed as there seems to be an indication 
that fewer global moves are accepted with the higher temperature chain, see Table 5, where swapping 
probabilities for each chain are indicated. Note that the observed frequency of successful swaps is not 
far from the case where adjacent chains are selected to swap at random with equal probability. Other 
measures of overlapping between chains (Liang and Wong, 2000; Iba 2001), based on a suitable index 
of variation of / (7) = logp (y I7) + logp (7) across sweeps, confirm the good performance of ESSi. 
Again some instability is present in the high temperature chains, see in Table 5 the overlapping index 
between chains 3, 4 and 4, 5 in Example 3 to 6. 

In Ex 1, we also investigate the influence of different values of the prior mean of the model size. We 
found that the average (standard deviation in brackets) acceptance rate across replicates for the delayed 
rejection exchange operator ranges from 0.493 (0.043) to 0.500 (0.040) for different values of the prior 
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mean on the model size, while the acceptance rate for the crossover operator ranges from 0.249 (0.021) 
to 0.271 (0.036). This strong stabiUty is not surprising because the automatic tuning modifies the tem- 
perature ladder in order to compensate for E [p^). Finally we notice that the acceptance rates for the 
local move, when n> p, increases with higher values of the prior mean model size, showing that locally 
the algorithm moves more freely with E (p^) = 20 than with E (p^) = 5. 

B.3 Performance of ESSi and comparison with SSS 

Performance of ESSi 

We conclude this Section by discussing in details the overall performance of ESSz with respect to the 
selection of the true simulated effects. As a first measure of performance, we report for all the simulated 
examples the marginal posterior probabiUty of inclusion as described in G&McC (1997) and Hans et al. 
(2007). In the following, for ease of notation, we drop the chain subscript index and we exclusively refer 
to the first chain. To be precise, we evaluate the marginal posterior probabiUty of inclusion as: 



as before. Besides plotting the marginal posterior inclusion probability (A.12) averaged across sweeps 
and replicates for our simulated examples, we will also compute the interquartile range of (A.12) across 
replicates as a measure of variability. 

In order to thoroughly compare the proposed ESS algorithm to SSS (Hans et al., 2007), we present 
also some other measures of performance based on p (7 |y ) and : first we rank p (7 jy ) in decreasing 
order and record the indicator 7 that corresponds to the maximum and 1, 000 largest p (7 |y ) (after bum- 
in). Given the above set of latent binary vectors, we then compute the corresponding leading to "i?^: 
maxp (7 |y )" as well as the mean R^ over the 1, 000 largest p (7 \y), "i?^; 1, 000 largest p (7 |y )", both 
quantities averaged across replicates. Moreover the actual ability of the algorithm to reach regions of high 
posterior probabiUty and persist on them is monitored: given the sequence of the 1, 000 best 7s (based 




(A.12) 



with C = Ylit=i tP {y |7^*'' ) P (7*"*'') ^ number of sweeps after the burn-in. The posterior 
model size is similarly defined, p {pj \y) ~ C"^ J2t=i,...,T l(|7(t)|=p^) (7)^' {v |7^*^ ) P (7^*^)' with C 
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on p (7 \y)), the standard deviation of the corresponding R^s, shows how stable is the searching strategy 
at least for the top ranked (not unique) posterior probabilities: averaging over the repUcates, it provides 
an heuristic measures of "stabihty" of the algorithm. Finally we report the average computational time 
(in minutes) across replicates of ESSz written in Matlab code and run on a 2MHz CPU with 1.5 Gb RAM 
desktop computer and of SSS version 2.0 on the same computer. 

Comparison with SSS 

Figure 6 presents the marginal posterior probability of inclusion for ESSi with r = 1 averaged across 
replicates and, as a measure of variability, the interquartile range, blue left triangles and vertical blue solid 
line respectively. In general the covariates with non-zero effects have high marginal posterior probability 
of inclusion in all the examples: for example in Ex3, Figure 6 (a), the proposed ESSz algorithm, blue 
left triangle, is able to perfectly select the last 45 covariates, while the first 15, which do not contribute 
to y, receive small marginal posterior probability. It is interesting to note that this group of covariates, 
. . . , /3i5) = (0, . . . , 0), although correctly recognised having no influence on y, show some vari- 
ability across replicates, vertical blue solid Une: however, this is not surprising since independent priors 
are less suitable in situations where all the covariates are mildly-strongly correlated as in this simulated 
example. On the other hand the second set of covariates with small effects, (^Sie, • • • , /S30) = (1, • • • , 1)> 
are univocally detected. The ability of ESSz to select variables with small effects is also evident in Ex6, 
Figure 6 (d), where the two smallest coefficients, /32 = 0.112 and /3io = 0.950 (the second and last re- 
spectively from left to right), receive from high to very high marginal posterior probability (and similarly 
for the other rephcates, data not shown). In some cases however, some covariates attached with small 
effects are missed (e.g. Ex4, Figure 6 (b), the last simulated effect which is also the smallest, /3i6 = 0.5, 
is not detected). In this situation however the vertical blue sohd hne indicates that for some replicates, 
ESSi is able to assign small values of the marginal posterior probability giving evidence that ESSz fully 
explore the whole space of models. 

Superimposed on all pictures of Figure 5 are the median and interquartile range across replicates of 
p (7j = l\y), j = 1, . . . for SSS, red right triangles and vertical red dashed hne respectively. We 
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see that there is good agreement between the two algorithms in general, with in addition evidence that 
ESSi is able to explore more fully the model space and in particular to find small effects, leading to a 
posterior model size that is close to the true one. For instance in Ex3, Figure 6 (a), where the last 30 
covariates accounts for most of i?^, SSS has difficulty to detect {Pie, . . . , /Sso), while in Ex6, it misses 
P2 = 0.112, the smallest effect, and surprisingly also ^4 = —2.595 assigning a very small marginal 
posterior probability (and in general for the small effects in most replicates, data not shown). However 
the most marked difference between ESSi and SSS is present in Ex5: as for ESSi, SSS misses three 
effects of "model 1" but in addition ^4 = 1, = —1 and ^8 = 1-5 receive also very low marginal 
posterior probability, red right triangle, with high variability across replicates, vertical red dashed line. 
Moreover on the extreme left, as noted before, ESSz is able to capture the biggest coefficient of "model 
2" while SSS misses completely all contaminated effects. No noticeable differences between ESSi and 
SSS are present in Exl and Ex2 for the marginal posterior probability, while in Ex4, SSS shows more 
variability in ^(7^ = 1 |y) (red dashed vertical fines compared to blue solid vertical fines) for some 
covariates that do receive the highest marginal posterior probability. 

In contrast to the differences in the marginal posterior probability of inclusion, there is general agree- 
ment between the two algorithms with respect to some measures of goodness of fit and stability, see Table 
6. Again, not surprisingly, the main difference is seen in Ex5 where ESSi with r = 1 reaches a better 
i?^ both for the maximum and the 1, 000 largest p (7 \y). SSS shows more stabifity in all examples, but 
the last: this was somehow expected since one key features of SSS in its ability to move quickly towards 
the right model and to persist on it (Hans et al, 2007), but a drawback of this is its difficulty to explore 
far apart models with competing i?^ as in Ex5. Note that ESSi shows a small improvement of i?^ in all 
the simulated examples. This is related to the abifity of ESSi to pick up some of the small effects that 
are missed by SSS, see Figure 6. Finally ESSi shows a remarkable superiority in terms of computational 
time especially when the simulated (and estimated) is large (in other simulated examples, data not 
shown, we found this is always true when pj > 10): the explanation lies in the number of different 
models SSS and ESSi evaluate at each sweep. Indeed, SSS evaluates p + p^ip — p-y), where p^ is the 
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size of the current model, while ESSi theoretically analyses an equally large number of models, pL, but, 
when p > 77,, the actual number of models evaluated is drastically reduced thanks to our FSMH sampler. 
In only one case SSS beats ESSi in term of computational time (Ex5), but in this instance SSS clearly 
underestimates the simulated model and hence performs less evaluations than would be necessary to ex- 
plore faithfully the model space. In conclusion, we see that the rich porfolio of moves and the use of 
parallel chains makes ESS robust for tackUng complex covariate space as well as competitive against a 
state of the art search algorithm. 
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Figure 1: Top panels: (a) trace plot of the log posterior probability, logp (7 \y), and (b) model size, p^, 
across sweeps for the first real data example, eQTL analysis, using ESSg with r = 29 and FSMH as 
local move. Vertical dashed lines indicate the end of the burn-in. Bottom panels: (c) trace plot of the 
log posterior probability when MC3 is used as a local move; (d) kernel densities of logp (7 \ y) for the 
retained chain in the 25 replicates of the analysis when only FSMH and only MC3 are used as a local 
move respectively. Plot restricted to regions of high posterior probability. 



40 




0123456789 0123456789 



(a) (b) 

Figure 2: (a) Posterior model size for the first real data example, eQTL analysis: black solid line for 
ESS^ with T fixed at 29 and black dashed fine for ESS^ with Z-S prior, (b) Posterior model size for 
mQTL analysis, second real data example, using ESS^ with fixed and random r. 





Figure 4: Accumulated posterior mass as a function of the models recorded. Plot generated using 25 
replicates of the analysis of the first real data example and normalised by the total mass found by ESSg, 
T = 29, with only block crossover move (po = 0.25). 1-point and uniform crossover accumulate around 
90% of the total mass accumulated by ESS^ with only block crossover, while adaptive crossover only 
85%. 
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(c) (d) 

Figure 5: For ESSi with r = 1: (a) trace plot of the log posterior probability, logp (7 \ y), and (b) model 
size, p^, across sweeps for one replicate of Exl with E (p^) = 20, top panels and Ex4, bottom panels. 
Vertical dashed lines indicate the end of the bum-in. 
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Figure 6: Median and interquartile range of the marginal posterior probability of inclusion (A. 12) for 
Ex3, (a), Ex4, (b) and Ex5, (c), across replicates. Each graph is constructed as follows: bottom part, 
pairwise squared correlation (Xj, Xji), j = 1, . . . ,p, between predictors for one selected replicate, 
grey scale indicates different values of squared correlation; blue left and red right triangles, median 
of p (7j = 1 |y ) across replicates for ESSi with r = 1 and SSS respectively; vertical blue solid lines 
and vertical red dashed lines, interquartile range of p^-yj = l\y) across replicates for ESSi and SSS 
respectively; upper and lower green triangles, simulated models. Selected replicate of Ex6, (d), shows 
marginal posterior probability of inclusion (blue left and red right triangles for ESSi r = 1 and SSS 
respectively). Marginal posterior probability of inclusion lower than 0.025 not shown. 
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Mode{pj\y) E{T\y) R^* B^** Stability 


ESSg, r = 29 

eQTL 

ESSg withp(r) 


2 - 0.716 0.704 0.257 
2 20.576 0.716 0.689 0.099 

2 - 0.843 0.843 ^ 
2 63.577 0.843 0.843 «0 


ESSg, r = 50 

mQTL 

ESSg withp(r) 




Crossover DR Exchange ALL Exchange Acc. rate r Time (min.) 


ESSg, r = 29 

eQTL 

ESSg with p (r) 


0.214 0.534 0.671 - 28 
0.243 0.585 0.711 0.438 30 
0.214 0.514 0.669 - 302 
0.226 0.571 0.717 0.434 309 


ESSg, r = 50 

mQTL 

ESSg with p{t) 



Table 1: Performance of ESS^ with and without the hyperprior on r for the first real data example, eQTL 
analysis, and second example, mQTL analysis. R^* and R^** correspond to "R^: maxp (7 and 
"i?^: 1, 000 largest p (7 |y)" respectively. The former indicates the coefficient of determination for the 
(first chain) best model visited according to the posterior probability p (7 |y ), while the latter shows the 
average i?^ for the (first chain) top 1 , 000 (not unique) visited models ranked by the posterior probability. 
"Stability" is defined as the standard deviation of i?^ for the (first chain) top 1, 000 (not unique) visited 
models (smaller values indicate better performance of the algorithm). In the bottom part of the Table, 
acceptance rate for specific moves are given. "DR Exchange" and "ALL Exchange" stands for "delayed 
rejection exchange" and "all-exchange" move respectively. 
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Version of ESS 


r p{t) 


Experiment (i) 


ESSg with only FSMH 
ESSg with only MC^ 


68% 88% 
28% 40% 


Experiment (ii) 


ESSg with only 1 -point crossover 
ESS^ with only block crossover 
ESS^ with only uniform crossover 
ESS^ with only adaptive crossover 


64% 80% 
80% 84% 
60% 84% 
60% 76% 



Table 2: Proportion of times different versions of ESS^ reach the same top visited model in the eQTL 
real data set with or without an hyperprior on r in 25 replicates of the analysis. 





Version of ESS^ 


r p{t) 


Experiment (ii) 


ESSg' with only 1-point crossover 
ESSg with only block crossover 
ESSg with only uniform crossover 
ESSg with only adaptive crossover 


0.303 0.335 
0.482 0.501 
0.026 0.042 
0.013 



Table 3: Average acceptance rate of the crossover operator for different versions of ESSg in 25 rephcates 
of the analysis of the first real data example with or without an hyperprior on r. 
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Vi-vA 






n 




120 




300 


120 


120 


200 


120 






60 




30 


60 


300 


1, 000 


775 




r. 



1 n 

lU 


zu 





r. 



K 



K 






Add/delete 


0.036 
(0.016) 


0.054 
(0.017) 


0.098 
(0.023) 


0.066 
(0.020) 


0.086 
(0.031) 


- 


- 


- 


Swap 


0.063 


0.100 


0.165 


0.070 


0.106 








(0.015) 


(0.019) 


(0.022) 


(0.015) 


(0.053) 








Crossover 


0.249 

(0.021) 


0.270 

(0.029) 


0.271 

(0.036) 


0.157 

(0.018) 


0.215 

(0.022) 


0.147 

(0.028) 


0.170 

(0.023) 


0.193 

(0.028) 


DR Exchange 


0.500 
(0.040) 


0.493 
(0.043) 


0.500 
(0.040) 


0.582 
(0.020) 


0.492 
(0.071) 


0.517 
(0.105) 


0.505 
(0.013) 


0.497 
(0.072) 



Table 4: Mean and standard deviation in brackets of EMC acceptance rates across replicates for ESSz 
with T = 1. "DR Exchange" stands for "delayed rejection exchange". 









Exl 




Ex2 


Ex3 


Ex4 


Ex5 


Ex6 


n 






120 




300 


120 


120 


200 


120 


P 






60 




30 


60 


300 


1,000 


775 






5 


10 


20 


5 


5 


5 


5 


5 




/ = 1 


0.157 


0.137 


0.110 


0.065 


0.160 


0.180 


0.201 


0.214 




1 = 2 


0.250 


0.232 


0.204 


0.185 


0.271 


0.276 


0.300 


0.316 


Swapping 


Z = 3 


0.220 


0.220 


0.223 


0.255 


0.245 


0.223 


0.231 


0.231 




Z = 4 


0.240 


0.252 


0.280 


0.293 


0.215 


0.206 


0.182 


0.167 




l = h 


0.142 


0.160 


0.182 


0.201 


0.110 


0.112 


0.083 


0.070 




I = 1,2 


1.360 


1.600 


2.101 


2.680 


1.350 


0.733 


0.569 


0.526 




1 = 2,3 


1.570 


1.570 


1.600 


0.870 


1.430 


1.021 


0.913 


0.893 


Overlapping 




















I = 3,4 


1.400 


1.290 


1.050 


0.600 


2.111 


1.329 


1.491 


1.696 




Z = 4,5 


1.100 


0.992 


0.690 


1.251 


4.131 


1.503 


2.304 


2.499 



Table 5: Swapping probabiUty for ESSz with r = 1 defined as the observed frequency of successful 
swaps for each chain (including delayed rejection exchange and all-exchange operators) averaged across 
replicates. Overlapping measure defined as V (/ (7/)) (l/t/+i - l/inf, Liang and Wong (2000) with 
/ (7;) = logp (y |7i ) + logp (7/). Target value for consecutive chains is O (1). 
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Exl 




Ex2 


Ex3 


Ex4 


Ex5 


Ex6 




n 




120 




300 


120 


120 


200 


120 




P 




60 




30 


60 


300 


1,000 


775 






5 


10 


20 


5 


5 


5 


5 


5 




p2 * 


0.864 


0.867 


0.871 


0.975 


1 


0.962 


0.703 


0.997 




(0.029) 


(0.027) 


(0.023) 


(0.003) 


(~ 0) 


(0.011) 


(0.043) 


(0.005) 


ESSi, 




0.863 
(0.027) 


0.866 
(0.026) 


0.874 
(0.023) 


0.975 
(0.003) 


1 

(~ 0) 

\ / 


0.957 
(0.014) 


0.689 
(0.048) 


0.997 
(0.003) 


T=l 


Stability 


0.003 
(0.001) 


0.003 
(0.002) 


0.005 
(0.002) 


^ 
(~ 0) 


0) 
(~ 0) 


0.005 
(0.004) 


0.015 
(0.007) 


0.002 
(0.002) 




Time (min.) 


6 

« 1) 


6 

« 1) 


7 

(< 1) 


16 

(< 1) 


18 

(1) 


166 

(32) 


338 

(43) 


202 

(40) 




r?2 * 

-ft-y 


0.863 


0.867 


0.870 


0.975 


PS 1 


0.956 


0.577 


0.997 




(0.027) 


(0.025) 


(0.024) 


(0.003) 


(« 0) 


(0.016) 


(0.074) 


(0.004) 


sss 




0.863 
(0.027) 


0.867 
(0.025) 


0.870 
(0.024) 


0.975 
(0.003) 


0.999 
(«. 0) 


0.955 
(0.016) 


0.565 
(0.078) 


0.996 
(0.004) 


Stability 




(0) 




(0) 



0) 


^ 
0) 




0) 


0.001 
(0.002) 


0.009 
(0.015) 


0.004 
(0.006) 




Time (min.) 


12 

(1) 


12 

(2) 


13 

(2) 


118 

(26) 


497 

(75) 


502 

(241) 


169 

(81) 


549 

(159) 



Table 6: Comparison between ESSi with r = 1 and SSS for the six simulated examples. Standard 
deviation in brackets. * and i?^ ** correspond to "R^ : max p (7 1 y )" and "i?^ : 1 , 000 largest p (7 | y )" 
respectively. 
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